The goal of this notebook is to compare pseudobulk and bulk
calculations to determine which pseudobulk calculation should we proceed
with for modeling: the log2 of the sum of raw counts
(pseudobulk_log_counts) or the
DESeq2-normalized sum of raw counts
(pseudobulk_deseq). To this end, we’ll visualize expression
distributions, both on their own and compared to bulk TPM.
renv::load()
library(ggplot2)
theme_set(theme_bw())
data_dir <- here::here("analysis", "pseudobulk-bulk-prediction", "data")
tpm_dir <- file.path(data_dir, "tpm")
pseudobulk_dir <- file.path(data_dir, "pseudobulk")
tpm_files <- list.files(
path = tpm_dir,
full.names = TRUE,
pattern = "-tpm\\.tsv$"
)
tpm_names <- stringr::str_split_i(basename(tpm_files), pattern = "-", i = 1)
names(tpm_files) <- tpm_names
pseudobulk_files <- list.files(
path = pseudobulk_dir,
full.names = TRUE,
pattern = "-pseudobulk\\.tsv$"
)
pseudobulk_names <- stringr::str_split_i(basename(pseudobulk_files), pattern = "-", i = 1)
names(pseudobulk_files) <- pseudobulk_names
# Make sure we have the same projects, in the same order
stopifnot(
all.equal(names(tpm_files), names(pseudobulk_files))
)
We’ll make both a long and wide version of the data for convenience throughout the notebook.
project_long_df_list <- purrr::map2(
tpm_files,
pseudobulk_files,
\(tpm_file, pseudo_file) {
dplyr::bind_rows(
# TPM needs to be in log2 space
readr::read_tsv(tpm_file, show_col_types = FALSE) |>
dplyr::mutate(expression = log2(expression)),
readr::read_tsv(pseudo_file, show_col_types = FALSE)
)
}
)
# Make a wide version as well
project_wide_df_list <- project_long_df_list |>
purrr::map(
\(df) {
df |>
tidyr::pivot_wider(names_from = expression_type, values_from = expression)
}
)
First, let’s get a general sense of the scale of values in the pseudobulk quantities:
project_wide_df_list |>
purrr::map(
\(df) {
df |>
# consider only genes with values not lost to transformation
dplyr::filter(is.finite(pseudobulk_log_counts)) |>
dplyr::pull(pseudobulk_log_counts) |>
summary()
}
)
$SCPCP000001
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 2.322 5.322 5.301 8.134 19.707
$SCPCP000002
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 2.322 5.322 5.328 8.134 20.643
$SCPCP000006
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 2.000 4.700 4.846 7.375 20.125
$SCPCP000009
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 2.000 4.700 4.850 7.418 21.389
$SCPCP000017
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 2.000 5.285 5.377 8.480 21.760
project_wide_df_list |>
purrr::map(
\(df) summary(df$pseudobulk_deseq)
)
$SCPCP000001
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.831 0.000 0.000 2.014 4.627 20.689
$SCPCP000002
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.609 0.000 0.000 2.066 4.653 19.238
$SCPCP000006
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.5231 -0.2927 0.0000 2.0203 4.8461 18.2792
$SCPCP000009
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.5802 0.0000 0.2317 2.5097 5.2150 19.8548
$SCPCP000017
Min. 1st Qu. Median Mean 3rd Qu. Max.
-2.5783 -0.2888 0.0000 2.2524 5.2874 22.8423
Due to the different transformation approaches, the
pseudobulk_deseq version has some negatives for fractional
values, but pseudobulk_log_counts has a lower bound of 0.
These different scales of data are important to keep in mind.
Due to the log2 transformation and sparsity of scRNA-seq
counts, many values in pseudobulk_log_counts are expected
to be -Inf, aka essentially unobserved in the single-cell
data. We’ll look at histograms, per project, of the percentage of genes
with defined:
finite_df <- project_wide_df_list |>
purrr::map(
\(df) {
df |>
dplyr::count(sample_id, is_finite = is.finite(pseudobulk_log_counts)) |>
tidyr::pivot_wider(
names_from = is_finite,
values_from = n,
names_prefix = "finite_",
values_fill = 0
) |>
dplyr::mutate(percent_genes_with_expression = finite_TRUE/(finite_FALSE + finite_TRUE))
}) |>
purrr::list_rbind(names_to = "project_id")
ggplot(finite_df) +
aes(x = percent_genes_with_expression, fill = project_id) +
geom_histogram(bins = 15) +
facet_wrap(vars(project_id), nrow = 1) +
theme(legend.position = "none")
It looks like about half the genes, but ranging from ~20 - 60%, of genes in a given sample are non-zero. This means about 40-80% of genes in the genome are not observed per sample.
What are the pseudobulk_deseq values where the
pseudobulk_log_counts values (above) are
-Inf?
deseq_inf_df <- project_wide_df_list |>
purrr::map(
\(df) dplyr::filter(df, is.infinite(pseudobulk_log_counts))
)|>
purrr::list_rbind(names_to = "project_id")
ggplot(deseq_inf_df) +
aes(x = pseudobulk_deseq, fill = project_id) +
geom_histogram(bins = 15) +
facet_wrap(vars(project_id), nrow = 1) +
theme(legend.position = "none")
So, lots of the 0s are here which we expect to see. Are all the 0s here?
If all the 0s are the same as all the infinites, then there should be no rows where 0 and finite co-occur:
project_wide_df_list |>
purrr::map(
\(df) {
df |>
# when deseq is 0, is log_counts always -inf?
dplyr::filter(
pseudobulk_deseq == 0,
is.finite(pseudobulk_log_counts)
) |>
nrow()
})
$SCPCP000001
[1] 0
$SCPCP000002
[1] 0
$SCPCP000006
[1] 0
$SCPCP000009
[1] 0
$SCPCP000017
[1] 0
Indeed, pseudobulk_log_counts is never finite (aka, it
is always -Inf) when pseudobulk_deseq is 0,
which is good to know.
Let’s also check if we have the same genes between bulk and
single-cell. Any genes present in one modality but not the other will
have NA expression (transforming any 0 expressions will
have produced -Inf, not NA, so those
NAs are from joining).
Are any in single-cell but not bulk?
project_wide_df_list |>
purrr::map(\(df){
df |>
dplyr::filter(!is.na(bulk_tpm),
is.na(pseudobulk_deseq),
is.na(pseudobulk_log_counts))
}) |>
dplyr::bind_rows()
Nope! How about single-cell but not bulk?
only_single_cell <- project_wide_df_list |>
purrr::map(\(df){
df |>
dplyr::filter(is.na(bulk_tpm),
!is.na(pseudobulk_deseq),
!is.na(pseudobulk_log_counts))
}) |>
purrr::list_rbind(names_to = "project_id") |>
dplyr::count(project_id, ensembl_id)
only_single_cell
Yup! Plus, these numbers are all the sample numbers…
only_single_cell |>
dplyr::count(ensembl_id)
..and it’s the same genes all round (probably a reference thing…?).
In any case, what is the expression of these genes in pseudobulk?
genes_not_in_bulk <- project_wide_df_list |>
purrr::map(
\(df) df |> dplyr::filter(is.na(bulk_tpm))
) |>
purrr::list_rbind(names_to = "project_id") |>
dplyr::select(-bulk_tpm) |>
tidyr::pivot_longer(
contains("pseudobulk"),
names_to = "expression_type",
values_to = "expression"
)
ggplot(genes_not_in_bulk) +
aes(x = expression_type, y = expression) +
geom_boxplot() +
facet_wrap(vars(project_id), nrow = 1) +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
Warning: Removed 4142 rows containing non-finite outside the scale range
(`stat_boxplot()`).
Some of these genes actually have decent values in single-cell.
Here, we visualize distributions of all quantities:
project_long_df_list |>
purrr::imap(
\(df, project_id) {
ggplot(df) +
aes(x = expression, fill = expression_type) +
geom_density(alpha = 0.5) +
scale_fill_brewer(palette = "Dark2") +
facet_wrap(vars(expression_type), scales = "free", nrow = 4) +
ggtitle(project_id) +
theme(legend.position = "none")
}
) |>
patchwork::wrap_plots(guides = "collect", nrow = 1)
This section will look at the relationship between bulk and pseudobulk quantities with:
bulk ~ pseudobulk to do a
cursory comparison of model statistics
What does the relationship look like between bulk and each flavor of pseudobulk? Plots are organized by project and:
bulk tpm ~ deseq pseudocountsbulk tpm ~ log_counts pseudocountsproject_wide_df_list |>
purrr::imap(
\(df, project_id) {
p1 <- ggplot(df) +
aes(x = pseudobulk_deseq, y = bulk_tpm) +
geom_point(alpha = 0.2, size = 0.5) +
geom_smooth(method = "lm") +
facet_wrap(vars(sample_id), nrow = 5) +
ggtitle("bulk tpm ~ deseq")
p2 <- ggplot(df) +
aes(x = pseudobulk_log_counts, y = bulk_tpm) +
geom_point(alpha = 0.2, size = 0.5) +
geom_smooth(method = "lm") +
facet_wrap(vars(sample_id), nrow = 5) +
ggtitle("bulk tpm ~ logcounts")
patchwork::wrap_plots(p1, p2)
}
) |>
patchwork::wrap_plots(nrow = 5)
Let’s now get some stats for each sample. We’ll fit a linear model for each sample, and display some quantities below both as boxplots and the full table.
plot_stats <- function(df, column, title) {
ggplot(df) +
aes(x = expression_type, y = {{column}}, color = expression_type) +
geom_boxplot() +
scale_color_brewer(palette = "Dark2") +
ggtitle(title) +
facet_wrap(vars(project_id), nrow = 1) +
theme(legend.position = "none")
}
model_samples <- function(id, df) {
sample_df <- df |>
dplyr::filter(sample_id == id)
df_deseq <- sample_df |>
dplyr::filter(is.finite(pseudobulk_deseq), is.finite(bulk_tpm))
fit_deseq <- lm(bulk_tpm ~ pseudobulk_deseq, data = df_deseq)
df_log_counts <- sample_df |>
dplyr::filter(is.finite(pseudobulk_log_counts), is.finite(bulk_tpm))
fit_log_counts <- lm(bulk_tpm ~ pseudobulk_log_counts, data = df_log_counts)
# Tabulate and return some fit stats
data.frame(
expression_type = c("deseq", "log_counts"),
rsquared = c(broom::glance(fit_deseq)$r.squared, broom::glance(fit_log_counts)$r.squared),
coeff = c(broom::tidy(fit_deseq)$estimate[2], broom::tidy(fit_log_counts)$estimate[2]),
residual_sd = c(broom::glance(fit_deseq)$sigma, broom::glance(fit_log_counts)$sigma)
)
}
stats_df <- project_wide_df_list |>
purrr::map(
\(df) {
# We need to map over sample ids now
samples <- unique(df$sample_id)
names(samples) <- samples
fit_table <- samples |>
purrr::map(model_samples, df) |>
purrr::list_rbind(names_to = "sample_id")
return(fit_table)
}
) |>
# now, combine all projects into a single table
purrr::list_rbind(names_to = "project_id")
patchwork::wrap_plots(
plot_stats(stats_df, rsquared, "rsquared"),
plot_stats(stats_df, coeff, "coeff"),
plot_stats(stats_df, residual_sd, "residual_sd"),
nrow = 3
)
pseudobolk_deseq tends to outperform
pseudobolk_log_counts for all projects except
SCPCP000009, but notably there are only 3 samples for this
projectSCPCP000017, and SCPCP000001 and
SCPCP000002 show the strongest relationships overallAll the actual values are here:
stats_df
sessionInfo()
R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS 15.2
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/New_York
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_3.5.1
loaded via a namespace (and not attached):
[1] sass_0.4.9 generics_0.1.3 tidyr_1.3.1
[4] renv_1.0.11 stringi_1.8.4 lattice_0.22-6
[7] hms_1.1.3 digest_0.6.37 magrittr_2.0.3
[10] evaluate_1.0.1 grid_4.4.0 RColorBrewer_1.1-3
[13] fastmap_1.2.0 Matrix_1.7-1 rprojroot_2.0.4
[16] jsonlite_1.8.9 backports_1.5.0 BiocManager_1.30.25
[19] mgcv_1.9-1 purrr_1.0.2 scales_1.3.0
[22] jquerylib_0.1.4 cli_3.6.3 rlang_1.1.4
[25] crayon_1.5.3 bit64_4.5.2 munsell_0.5.1
[28] splines_4.4.0 withr_3.0.2 cachem_1.1.0
[31] yaml_2.3.10 tools_4.4.0 parallel_4.4.0
[34] tzdb_0.4.0 dplyr_1.1.4 colorspace_2.1-1
[37] here_1.0.1 broom_1.0.7 vctrs_0.6.5
[40] R6_2.5.1 lifecycle_1.0.4 stringr_1.5.1
[43] bit_4.5.0.1 vroom_1.6.5 pkgconfig_2.0.3
[46] pillar_1.10.0 bslib_0.8.0 gtable_0.3.6
[49] glue_1.8.0 xfun_0.49 tibble_3.2.1
[52] tidyselect_1.2.1 knitr_1.49 farver_2.1.2
[55] htmltools_0.5.8.1 nlme_3.1-166 patchwork_1.3.0
[58] rmarkdown_2.29 labeling_0.4.3 readr_2.1.5
[61] compiler_4.4.0